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Recruitment and Intelligent System 


The Carrier Centre is information, analytical and organizational support of job placements of students and graduates. 
The information system for supporting all main activities was developed. Nowadays the system strengthens links 
between students and companies as repository of the CVs and vacancies. On the other side the system should be as a 
virtual recruiter that take into account student’s personal abilities and preferences, available jobs, Company profiles, 
local labour market infrastructure, industrial and technological trends, account job specification, available human 
resource to provide the effective decisions on employment. This paper presents the intelligent management system 
based on text mining methods for supporting recruitment services. 


Introduction 


Nowadays one of the most frightening social problems in Ukraine is a formal 
unemployment among young people. Even after receiving the university degree young 
professionals quite seldom can find jobs which are adequate to their taken specialties. In 
particular, it is rather complicated to find an appropriate position after graduating from the 
universities. Closer co-operation between the universities and the enterprises is needed during 
strategic planning of the new high-tech positions in companies. Such a co-operation would be 
beneficial for the both sides and will allow students to get high quality positions in private and 
public sectors. Two main classes of services provided by Carrier Centre: to help educated 
professionals to find appropriate job and to help companies to find right professionals to fill 
available jobs. 

The University Carrier Centre provides a student’s consultancy and solution taking into 
account his/her personal abilities and preferences, available jobs, account job specification, 
available human resources, pretenders’ profiles based on University diplomas, University and 
Company profiles, National educational policy and standards, local labour market infrastructure, 
industrial and technological trends. On the other hand the University Carrier Centre does a lot of 
labour market research like analysis of students’ placement through practical bases, analysis of 
specialists’ and masters’ employment to enterprises, weekly statistics of student’s employment, 
analysis of tendencies of labour demand for specialties. This means that Carrier Centre makes 
analytical research and proposes solutions for student competitiveness growth (fig. 1). Of course 
this is not possible without an appropriate modern analytical information system. 
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Figure 1 — Stages of competitiveness growth 
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One of the system module is virtual consultant for employment. 

The CV description is a challenge. Each of us is individual and delivers the information in 
one’s own way. The manager of the University Career Centre looks through and analyzes daily 
tens resumes of the students, faces the most different styles of their writing. Formatting, fonts and 
logic structure of the CV’s are completely arbitrary. Moreover some companies have their own 
structure of CV’s, but some of them want to review and analyze the style and logic of the 
students’ CV’s who do not have big experience in writing them. Systems which allow shifting on 
itself a part of the most routine actions of the manager on processing of the CV's and companies 
vacancies which are urgent now. Two main features of such system are: automatic information 
gathering about candidates and vacancies of the companies, clustering CV's and vacancies, and 
automatically matching the most favourable vacancies to the corresponding CV. In other words 
this system must work as a virtual intelligent web-consultant for the student. It is necessary to 
notice that there are criteria on which CV of the student is selected by companies but it is 
subjective approach. 

Nowadays the process of comparing the new CV’s from the earlier existing one, 
classification of CV’s is completely by hand. The number of CV’s grows (and the number of 
CV’s copies of the same candidate) and admissible time for their processing is reduced. It leads 
to that it is impossible to process the whole stream of arriving CV’s when their processing is 
made manually. Creation of the intelligent web-system as a virtual recruitment consultant allows 
to solve the following problems: 

1. To make the analysis of structure and recognition of fields of the CV for 
formalized representation. 

2. Automatic gathering of the CV's and vacancies from certain web-sites and their 
addition into database. 

3. Classification of all CV's and vacancies according to the subjects. 

4. Elimination of duplicates. 

5. Flexible search according to user’s inquiries. 

6. Ranking of the resume and vacancies inside group: taking into account existing 
hierarchy of a subject domain using a matrix of skills and abilities. 

7. Matching vacancies to the most corresponding CV’s of the students. 

8. Annotation of the CV and CV's groups. 


1. Overview 


Some part of the information about the candidate is now carried out manually that 
leads to information distortion. However the automatic information extraction is not 
always correct, therefore completely automatic mode does not approach the given problem. 
More suitable is the automated mode with manual acknowledgement. In the majority of 
cases automatic processing yields good results but in case when the system could not 
correctly distinguish some parts of the resume the manager carries out a marking manually. 
Thus, the system receives one more copy of training sample which will be used in the next 
training phase. Also the system should be able to check the consistency of the resume. For 
example, it is used to check the intersection of job periods in different places, to check 
skills in CV and in the description of real projects. As a basis function the system should 
classify the arriving resumes and vacancies to certain groups and update a database. 

Unit patterns of the CV are not fixed; therefore generally we consider that the 
candidate and companies send the resume and vacancies in any form. CV of the candidate 
always consists of some parts, in other words, has a logical structure. Logic blocks usually 


«LUtyannit intemeKT» 4’2008 33 


Shatovska T., Kamenieva I. 


2S 


are called. It allows to allocate them in text for text clustering methods [1-4]. In spite of the 
fact that styles of a CV's writing strongly vary general blocks can be extracted like 
Surname, Sex, Date of birth, Descriptions of the previous job places etc. Therefore we 
allocate common set of attributes which could be present in the majority of the CVs. 


Figure 2 — Text information analyse 


To provide the module flexibility the fixed templates and rules of data extraction from CV 
and vacancy are not used. The model of the CV and vacancy is created. We use the approach to 
adjust each resume to the constructed model of the CV. For some final objects the set of rules for 
information extraction should be created. If some blocks are incorrectly distinguished the module 
turns to the training mode and creates the additional rule of the information extraction which is 
put down to the knowledgebase. When some new CV's blocks appear (for example, the 
information about additional interests) in a mode of model editing the new elements are added 
and the extraction rule is constructed. Then all CVs are updating with the link to a new property. 
This module can be adjusted on extraction of any data from job placement area — the resume, the 
analysis of questionnatres etc. 


2. Method descriptions 
2.1. Curriculum Vitae Clustering 


Clustering is a process of grouping the data into classes or clusters so that objects 
within a cluster have high similarity in comparison to one another, but they are very dissimilar 
to objects in other clusters. The list of classes is defined in advance and includes all 
necessary directions of student’s education of our University: management, mobile 
communication, computer science, radioelectronics etc. Each resume after processing is 
presented in the form of such scheme: a key-value, as R = {7;}, where 7; — the resume, 
r, = {< key, value>}, where i = 1 ... n, n — number of attributes. 

The description model is the same for all CVs. For each cluster the rule of the 
resume frequency in determined group was define as F = {fj, ..., fin}. 

Appling the given conditions, we will receive set of intersecting subsets CjMC; 40, 
where C;=fj (ri). It is shown on Fig. 3. 

For each attribute the measure TF-IDF was applied [2], [3]. Each CV or vacancy d 
is considered to be a weighted vector in the term-space and each document (vacancy or 
CV) can be presented as ¢fi*log (n/dfi), tf2*log (n/dfr) ,..., tfn*log (n/dfin)), where tf; — is a 
frequency of i-th term in the document and df; — is a number of documents which contains 
i-th term, and n — is a total number of documents in sample. Each vector of the document 
should be normalised, ||disa¢|2 = 1. 
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GA with Java 
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Figure 3 — Initial clustering 
For similarity estimation the cosine similarity method is used, which is defined as in 
equation (1). 
Similarity measure 
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where d; and d; — components of vector documents, ¢— is a vector dimension. 
Further it is calculated the total TF-IDF weight to classify the CV and vacancy. 


cosine(d,,d,)= 


2.2. Clustering the CV’s using integrate approach 


Inside each cluster we define the conditions to separate CVs and vacancies into 
subclusters to group a subcategory. These grouping conditions are defined by the user. In 
other words each resume or vacancy is an object with attributes where attributes are 
properties of the resume or vacancy, for example the description of certain skills. At the 
first stage of clustering we use a hierarchical approach [5], [6]. It creates a hierarchical 
decomposition of the given dataset. 

We integrate hierarchical agglomeration and iterative relocation by first using a 
hierarchical agglomerative algorithm with UPGMA method [6], [7] and then refining the 
result using our iterative approach [8], [9], similar to the Chameleon clustering [10]. At the 
final iteration of algorithm it determines the similarity between each pair of clusters by 
taking into account both at their relative inter-connectivity and their relative closeness. 

In our algorithm during first phase we construct an asymmetric k-NN graph and there 
exists an edge between two points if for one of it there exists closest neighbour among all 
existing neighbours according to the value of k. Note that the weight of an edge connecting 
two objects in the k-NN graph is a similarity measure between them, as usual a simple 
distance measure (or inversely related to their distance). 

The weight of an edge we compute as a weighted distance between objects. During 
coarsening phase the set of smaller hypergraphs is constructed. In the first stage of 
coarsening process we choose the set of vertices with maximum degrees and match it with 
a random neighbour. On the other stages we visit each vertex in a random order and match 
it with adjacent vertex via heaviest edge. Note that usually the weight of an edge 
connecting two nodes in a coarsened version of the graph is the number of edges in the 
original graph that connects the two sets of original nodes collapsed into the two coarse 
nodes. In our case we compute the weight of the hyperedge as the sum of the weights of all 
edges that collapse on each other during coarsening step. We stop the coarsening process at 
each level as soon as the number of multivertices of the resulting coarse hypergraph has 
been reduced by a constant less than two. 
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On the next level of algorithm we produce a set of small hypergraphs using k-way 
multilevel paradigm [11]. We start the process of partitioning by choosing k most heavily 
multivertices, where k = 8, 16, 32. After that we gather one by one all neighbours from 
each chosen vertex and obtain the initial partitioning w.r.t to the balancing constant. The 
problem of computing an optimal bisection of a hypergraph is NP-hard. One of the most 
commonly used objective function is to minimize the hyperedge-cut of the partitioning; 
i.e., the total number of hyperedges that span multiple partitions [11]. In our experiments 
we use a greedy refinement algorithm developed by George Karypis [11], but as the gain 
function for each vertex we compute the differences between the sum of the weights of 
edges incident on vertex that go to the other partition and the sum of edges weights that 
stay within the partition. We choose the vertex with maximum positive gain and move it if 
it results in a positive gain, so we work only with boundary vertices. 

After partitioning of hypergraph into the large number of small parts we start to 
merge the pair of clusters for which both relative inter-connectivity and their relative 
closeness are high. In our research we use George Karypis formula to compute the 
similarity between sub-clusters and modified expression by changing the relative inter- 
connectivity to a new expression that estimates the average weights of edges in each 
sub-graph and the number of edges that connect two partitions to the number of edges 
that stay within the smallest partition. Experimental results showed that this method is 
not sensitive to the value of k and doesn’t need a specific k-nearest neighbour graph 
creating [7]. 

The resume is distinguished if the subcluster is defined and in appropriate way is 
saved into system. 


© Uerp Kapnepa - sano 4 paGors Ana CTYACHTOS 4 BenyCKHHEOS XHYP3 - Windows Internet Explorer 


alin Dparxa Bua Vs6pannoe Cepenc Cnpaena 


ir 
ze (@ Uernp Kapeepa - eaxancum m paGora ana ctyaen... nm? & 
—— 
ahah kaw guanine Minimum salary: from - before 
Advertisment Sex: 
Our en ployee Degree A 
° Additional information: ‘4 
Participants 
IV Fair Vacancy Reset | Ad 
Sponsors System analysts - section Tenexonmynnxa a section 
IV Fair Vacancy , TIGAC - speciality TCM, WMS - special 
Business/system analyst job title Vrowenep raGenumeex cere’ - job title 
Online N/A - min. salary (UAM N/A - min. salary (U 
signup 
Manual 23.04.2008 23.04.2008 
signup Dstals 
Tenenoeorymuntpe - secho System managers - section 
Presentations TCM, KMS - spec ceCTCNNEH aQMENMCTpaTOp - specality 
Microsoft company app case - “eb — CUCTEMHbI AQMENMCTpPATOP - job ttle 
N/A- . Salary (UAH N/A - min. salary (UAH) 
Photos 
IV Fair Vacancy 23.04.2008 22.04.2008 
Details Details 
Engineers - section 
rod: 
———— elon Teh WMS soccity 
———— Wroxenep xOrmryTauRONNOTO oG6opyAonannn - job title Premera NpCexTos * KopROpaTHBNEIe TENnaHoreTyIONTe 
hed iotcheward Ab N/A - min. salary (UAH) 
Write resume 
22.04.2008 
ee 4 18.04.2008 
Details 
Engineers - section Programmers - section 
PT, APT, PTICK - specality Mporpannecta! - specality 
wHMCONeEP-pagnoTexnmx - job hte Vroxenep no moutpome tonne (QC Engineer) - job ttle 
N/A - min, salary (UAH) N/A - min. salary (UAH) 


Figure 4 — Clustering of vacancies 


2.3 Annotation of the candidate’s CV 


The systems of texts processing used different approaches to the text annotation. The 
most widespread way is the list of keywords. This way is simple for implementation, but there 
is lack of self-descriptiveness. Another way is an automatic abstract construction. This way 
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gives enough clear abstract, but is combined algorithmically. Considering a problem domain, it 
is offered to annotate (among other methods) the candidate’s CV by adding a subset of the 
attributes from common blocks of all CVs like skills and abilities of candidates. 

In our experiments 200 resumes have been manually selected and marked. The 
marking included allocation of blocks "Education", "Experience" and "Other". The 
sections "Contact information", "Hobby" etc has been entered into block "Other". In 
this part of our experiments we wanted to create a list of keywords for each of 
described above blocks. At the initial step of the preprocessing not printed symbols, 
stop words, marking symbols and also superfluous blanks, numbers and abbreviations 
were deleted. The second step is stemming. The algorithm "Porter stemming" adapted 
for Russian language is used [12]. It is also simple for implementation as it is 
constructed on heuristic rules of truncation of words and does not demand dictionary 
support. Unfortunately, in atypical words it commits errors, but it occurs seldom and 
does not influence on the final result. After normalization the word is located in the list 
of keywords for the given block. For the generated keywords their frequency of 
occurrence is calculated as well. Sometimes it is mistaken and the lines of length less 
than three appears. As these lines do not concern semantic carriers of the block, it is 
possible to remove them. As the result of the second step we receive the list of bases 
keywords with frequency of their occurrence in the block text. 

After forming this list it has turned out that one word can belong to several lists. In 
this case it is not the unique characteristic of the block. For uniqueness maintenance it is 
necessary to get rid of keywords intersection. Frequencies of such words were compared. 
The word remains in that group where frequency of its occurrence is more and leaves the 
group where frequency of it occurrence is less. Thus we have received not intersecting sets 
of keywords with frequency characteristics for each group. 

As practical experiments have shown, using only root of keywords does not give 
enough exact splitting. Therefore to define the boarder of blocks the phrases of blocks 
headings were used. At a stage of manual parsing of entrance files headings were 
separately allocated. In parallel for each of blocks the list of keywords and headings was 
formed. The accuracy of blocks separating using only keywords of headings was 80 %. If 
the heading border has not been distinguished the information was saved in system. By 
heading analysis we have defined possible splitting of the text in CV into blocks. To 
confirm or correct this splitting the statistical approach on the basis of the available 
information about keywords is used. As a result we receive the text broken into blocks 
Educations, Experience and Other. Splitting of the text into blocks occurs as follows. For 
each word we find its normal form using already mentioned algorithm stemming [12]. 
Further we search for the received normal form in lists of keywords and if we find it we 
paint this word in colour of group. 

In Computer Science the fact is the individual value of the data created or used by 
business process. The facts of our problem domain are: the date of birth, the marital status, 
date of receipt and the termination of one/several educational institutions, professional 
skills, citizenship, languages etc. Allocation of the facts occurs by the certain rules, 
constructed on the basis of regular expressions [13]. They are formed in the training phase. 

Using the CV as common model allows reaching the high quality of classification. 
The offered approach applied to analyze the vacancies and allows to solve such problems 
as comparison of the CVs and existing vacancies in the system. The method of discovery a 
list of top resumes (i.e. on what there are a lot demands of employers) is used. 
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Figure 5 — Top CV’s 


The pilot version of the system is developed. The efficiency analysis on 500 CVs has 
shown that in 87 % of cases it has made correct splitting into blocks. And in 82 % of cases 
the facts have been correctly allocated. The analysis of cases when the system could not 
break the text into blocks correctly has been carried out: atypical styles of CVs, HTML- 
tables. 


Conclusion 


The idea of developing the intelligent system for supporting the employment process 
of student is offered. As virtual consultant for employment it can force the process of job 
finding. 

The offered approach can be used for the decision of classification problems, 
segmentation and allocation of the facts in other areas connected with document circulation 
in recruitment services. 

Universal model of the CV and vacancy allows to attain high quality of classification. 
The offered method allows to solve such tasks as annotation of the resume of the candidate 
and automatic comparison of the resume of existing vacancy. The integrated clustering 
approach for CV's similarity estimation is offered. On the base of it the list of top actual 
CV's is formed. 
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T.B. THamoecoxa, IB. Kameneea 

PekpyTuurosa Ta intesleKTyasIbHa CHcTeMa 

«Kap’epa — Llentp» — ue incpopmattiiina, aHamiTw4ua i opraHi3alliida WOMoMora B TpalleBlalliITyBaHHi CTYCHTIB i 
BIMTIYCKHUKIB. bysia crBopeHa iH(popMalliiHa CHcTeMa JIA TiTPHMKU BCIX OCHOBHHX BHB MiaIbHOocTI. B qaHuit yac 
CHCTeMA 3MILIHIOE 3B’3KH MDK CTY]CHTaMH 1 KOMMaHLAMH AK CXOBHILe pe31oMe 1 BakaHcill. 3 iHIOoTo OoKy, cHcTema 
MOBMHHa OyTH AK BIPTyaJIbHHi peKpyrep, AKMM Oepe Zo yBaru OcoOucTI 3qiOHOCTI 1 TepeBarH CTyeHTAa, JOCTYMHi 
podoui Micua, mpodimi KommaHii, MiclleBy iH(pacTpykTypy TpyoBoro pHHKy, iHAYCTpiaIbHi 1 TeXHOJOTi4HI 
TeHIeHIUll, paxye celMPikallizo poOoTH, AOCTYMHMM JOAChKHM pecypc, WI0O 3abe3neuHTH ec:beKTHBHI pillieHHA y 
ccepi 3aituatocti. Ia cratra npeycTaByine iHTeIeKTyaJIbHy CHCTeMy yIIpaBJIHHA, 3aCHOBaHy Ha MeTOyax OOpoOKH 
TCKCTY JIA MATPUMKU pekpyTep-cepBiciB. 


T.B. THamoecxaa, H.B. Kameneea 

PekpyTHHroBad H HHTeWIeKTyaIbHan CHcTeMa 

«Kapbepa — Llentp» — 9To HH*opMallHoHHas, avasIMTH4eCKad HW OpraHv3alMOHHadt MOMOLWIb B TpyOycTporicTBe 
CTYe€HTOB UM BBIITYCKHHKOB. bpyia co3faHa WHopMallHOHHad CHCTeMa JUIA TMOZepyKKM OCHOBHBIX BHJOB 
WeaTesbHocTu. B HacToslee BPeMA CHCTeMa YKPeIUIAeT CBASH M@KIY CTYCHTaMH MW KOMMaHHAMH Kak XpaHWwInuye 
pe3roMe Hf BakaHcHit. C Apyroii CTOpoHbI, CHcTeMa JOJDKHa ObITb KaK BUPTYaJIbHbIM pekpytep, KOTOPbIM MIpHHHMaeT 
BO BHHMaHHe JIMYHBIe CIOCOOHOCTH HM TIpesqMOUTeHHA CTyeHTa, JOCTyMHbIe padouve Mecta, Npodwin KOMMaHHH, 
MeCTHYIO HHWpactpykTypy TpyOBoro PbIHKa, WHYCTpHasbHbIe HM TexHOMOTMYecKHe TeHeHIMH, CuHMTaeT 
cneyuuKalHro padorsl, JOCTYNHbI YeIOBeyeCKHM pecypc, 4TOOBI ObectIeyHTS 9:pPeKTUBHBIe pelleHHA B OOACTU 
3aHATOCTH. Ta CTaTbA MpeCTaBsIAeT MHTeIWICKTYaIbBHYIO CHCTeMy YIIpaBJICHHA, OCHOBaHHYIO Ha MeTOs\ax 
oOpaOoTKH TeKCTa J)It TOL Wep»xKKH peKpyTep-CepBHCoB. 
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